Introduction

DDSAnalytics is an analytics company that specializes in talent management solutions for Fortune 100 companies. Talent management is defined as the iterative process of developing and retaining employees. The executive leadership has identified predicting employee turnover as its first application of data science for talent management. We have been tasked to conduct an analysis of existing employee data.

We evaluated the employee data, conducted in-depth analysis of various parameters provided to us. We came up with various factors which impacts the employee attrition and as well as factors which can help predict the attrition of an employee. We also built a profile of an employee with high probability of leaving the company.

Exploratory Data Analysis

First we will see a sample of our data to understand what kind of variables we are evaluating.

##   ID Age Attrition    BusinessTravel DailyRate             Department
## 1  1  32        No     Travel_Rarely       117                  Sales
## 2  2  40        No     Travel_Rarely      1308 Research & Development
## 3  3  35        No Travel_Frequently       200 Research & Development
## 4  4  32        No     Travel_Rarely       801                  Sales
## 5  5  24        No Travel_Frequently       567 Research & Development
## 6  6  27        No Travel_Frequently       294 Research & Development
##   DistanceFromHome Education   EducationField EmployeeCount EmployeeNumber
## 1               13         4    Life Sciences             1            859
## 2               14         3          Medical             1           1128
## 3               18         2    Life Sciences             1           1412
## 4                1         4        Marketing             1           2016
## 5                2         1 Technical Degree             1           1646
## 6               10         2    Life Sciences             1            733
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2   Male         73              3        2
## 2                       3   Male         44              2        5
## 3                       3   Male         60              3        3
## 4                       3 Female         48              3        3
## 5                       1 Female         32              3        1
## 6                       4   Male         32              3        3
##                  JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1        Sales Executive               4      Divorced          4403
## 2      Research Director               3        Single         19626
## 3 Manufacturing Director               4        Single          9362
## 4        Sales Executive               4       Married         10422
## 5     Research Scientist               4        Single          3760
## 6 Manufacturing Director               1      Divorced          8793
##   MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1        9250                  2      Y       No                11
## 2       17544                  1      Y       No                14
## 3       19944                  2      Y       No                11
## 4       24032                  1      Y       No                19
## 5       17218                  1      Y      Yes                13
## 6        4809                  1      Y       No                21
##   PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## 1                 3                        3            80                1
## 2                 3                        1            80                0
## 3                 3                        3            80                0
## 4                 3                        3            80                2
## 5                 3                        3            80                0
## 6                 4                        3            80                2
##   TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1                 8                     3               2              5
## 2                21                     2               4             20
## 3                10                     2               3              2
## 4                14                     3               3             14
## 5                 6                     2               3              6
## 6                 9                     4               2              9
##   YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1                  2                       0                    3
## 2                  7                       4                    9
## 3                  2                       2                    2
## 4                 10                       5                    7
## 5                  3                       1                    3
## 6                  7                       1                    7

Our Data set has following variables and records.

## [1] "Number of variables: 36"
## [1] "Number of Records: 870"
## [1] "List of Variables: "
##  [1] "ID"                       "Age"                     
##  [3] "Attrition"                "BusinessTravel"          
##  [5] "DailyRate"                "Department"              
##  [7] "DistanceFromHome"         "Education"               
##  [9] "EducationField"           "EmployeeCount"           
## [11] "EmployeeNumber"           "EnvironmentSatisfaction" 
## [13] "Gender"                   "HourlyRate"              
## [15] "JobInvolvement"           "JobLevel"                
## [17] "JobRole"                  "JobSatisfaction"         
## [19] "MaritalStatus"            "MonthlyIncome"           
## [21] "MonthlyRate"              "NumCompaniesWorked"      
## [23] "Over18"                   "OverTime"                
## [25] "PercentSalaryHike"        "PerformanceRating"       
## [27] "RelationshipSatisfaction" "StandardHours"           
## [29] "StockOptionLevel"         "TotalWorkingYears"       
## [31] "TrainingTimesLastYear"    "WorkLifeBalance"         
## [33] "YearsAtCompany"           "YearsInCurrentRole"      
## [35] "YearsSinceLastPromotion"  "YearsWithCurrManager"

We have to clean up the data now to be able to use if for our analysis. First we will set the correct data type for the variables.

Lets first check the distribution of Monthly Income and distribution of Attrition as they are the data of Interest here.

The Monthly Income looks right skewed. Which is expected of a Income data and its in line with our expectation.

The Attrition data appears to be exteremly unbalanced. We have large number of “No’s” compared to “Yes’s”. Which means the data analysis could suffer from bias.

Before we take any action to balance the data, lets evaluate little more.

Lets check if Monthly Income and Age has any relationship with Attrition.

It looks like the “No” attrition is overwhelming compared to the acutal attrition.

Let’s check one more set of plots. Do we have a relationship between Attrition and the Job Level, Job Role, Job Involvement and Job Satisfaction?

Again its hard to tell if Job Level plays any role at in the Attrition. Because the “No” values are disproportionately higher.

Balancing the Data (Down Sample)

To solve this problem, we will balance the data using Down Sample feature. Where data will be reduced to match the Attrition values of Yes and No.

## [1] "Number of variables: 36"
## [1] "Number of Records: 280"
## [1] "List of Variables: "
##  [1] "ID"                       "Age"                     
##  [3] "Attrition"                "BusinessTravel"          
##  [5] "DailyRate"                "Department"              
##  [7] "DistanceFromHome"         "Education"               
##  [9] "EducationField"           "EmployeeCount"           
## [11] "EmployeeNumber"           "EnvironmentSatisfaction" 
## [13] "Gender"                   "HourlyRate"              
## [15] "JobInvolvement"           "JobLevel"                
## [17] "JobRole"                  "JobSatisfaction"         
## [19] "MaritalStatus"            "MonthlyIncome"           
## [21] "MonthlyRate"              "NumCompaniesWorked"      
## [23] "Over18"                   "OverTime"                
## [25] "PercentSalaryHike"        "PerformanceRating"       
## [27] "RelationshipSatisfaction" "StandardHours"           
## [29] "StockOptionLevel"         "TotalWorkingYears"       
## [31] "TrainingTimesLastYear"    "WorkLifeBalance"         
## [33] "YearsAtCompany"           "YearsInCurrentRole"      
## [35] "YearsSinceLastPromotion"  "YearsWithCurrManager"

Lets look at the Monthly Income and Age in relationship with Attrition again with balanced data.

We can now see a more clear pattern where Lower Income and Yonger Age has clear relationhip to higher Attrition.

Let’s check the relationship between Attrition and the Job Level, Job Role, Job Involvement and Job Satisfaction again.

We can see that Job Level 1, Job Role of Laboratory Technician and Research Scientists, Lower Job Involvement and Lower Job Satisfation leads to higher Attrition rate.

We will continue to do our data analysis using various parameters and plots to see if there is any relationship amoung the different variables with Monthly Income and Attrition.

From above plots we notice that Gender doesn’t play a role in terms of Attrition or Income. However, Stock Option, Over Time are very important factor in Attrition. Marital status also appears to play a role in Attrition.

Heat Maps

To further our analysis and validate the relationship of variables with each other. We will do heatmaps to check how variables are correlated to each other.

This analysis confirms our earlier assessment that Stock Option, Number of Years worked,Monthly Income, Over Time, Marital Status, Job Satisfaction, Job Role and Job Level plays a big role in the Attrition level.

Based on Exploratory Data Analysis we can say that a typical profile of Employee who is likely to leave a company is follows:

  1. Marital Status = Single
  2. Stock Option Level = 0
  3. Monthly Income = <$10,000
  4. Number of Years worked = 4
  5. Over Time = Yes
  6. Job Satisfaction = 1 and 2 (Low)
  7. Job Role = Lab Technician and Research Scientist
  8. Number of Companies = 5 or greater
  9. Age = Below 40

Surprisingly, following factors doesn’t have a bearing on the high Attrition rate:

  1. Percentage Salary Hike
  2. Work Life Balance
  3. Gender

Statistical Analysis - Regression

Monthly Income

After Exploratory Data Analysis, we will move on to Statistical Analysis of our Data.

First we will start with regression of Monthly Income.

To start with, we will remove unwanted variables like: 1. ID - ID shouldn’t matter 2. Employee Count - All values are 1 3. Employee Number - Employee number shouldn’t matter 4. Over 18 - All values are Yes 5. Standard Hours - All values are 80

Once we have the dataset, we will run stepwise regression to identify the significant variables

## 
## Call:
## lm(formula = MonthlyIncome ~ BusinessTravel + DistanceFromHome + 
##     Gender + JobLevel + JobRole + TotalWorkingYears + YearsInCurrentRole, 
##     data = data.red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2786.5  -621.9   -40.7   546.7  3989.9 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2803.511    414.929   6.757 9.17e-11 ***
## BusinessTravelTravel_Frequently   446.795    242.976   1.839  0.06707 .  
## BusinessTravelTravel_Rarely       848.644    208.736   4.066 6.35e-05 ***
## DistanceFromHome                  -21.175      7.427  -2.851  0.00470 ** 
## GenderMale                        188.245    133.238   1.413  0.15889    
## JobLevel2                        1935.318    272.643   7.098 1.19e-11 ***
## JobLevel3                        4746.914    360.499  13.168  < 2e-16 ***
## JobLevel4                        8349.739    535.478  15.593  < 2e-16 ***
## JobLevel5                       10876.864    612.231  17.766  < 2e-16 ***
## JobRoleHuman Resources           -757.796    470.934  -1.609  0.10880    
## JobRoleLaboratory Technician    -1115.079    342.555  -3.255  0.00128 ** 
## JobRoleManager                   3753.452    455.718   8.236 8.62e-15 ***
## JobRoleManufacturing Director     258.640    333.539   0.775  0.43878    
## JobRoleResearch Director         3938.158    403.869   9.751  < 2e-16 ***
## JobRoleResearch Scientist        -803.932    346.872  -2.318  0.02124 *  
## JobRoleSales Executive            364.975    255.826   1.427  0.15487    
## JobRoleSales Representative     -1063.900    382.812  -2.779  0.00585 ** 
## TotalWorkingYears                  31.151     15.624   1.994  0.04722 *  
## YearsInCurrentRole                 35.783     22.207   1.611  0.10830    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1025 on 261 degrees of freedom
## Multiple R-squared:  0.9524, Adjusted R-squared:  0.9491 
## F-statistic: 290.1 on 18 and 261 DF,  p-value: < 2.2e-16
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
##    nvmax     RMSE  Rsquared       MAE   RMSESD RsquaredSD    MAESD
## 1      1 3407.683 0.5067651 2676.7319 721.0088 0.16867373 481.0300
## 2      2 2942.690 0.5588824 2299.3641 686.4248 0.20866856 405.9648
## 3      3 1831.147 0.8320876 1516.7675 184.5672 0.05348063 182.9415
## 4      4 1300.565 0.9162584  960.7193 224.2638 0.03249893 156.1966
## 5      5 1243.909 0.9271295  933.5867 137.3650 0.02324789 100.6893
## 6      6 1112.919 0.9380030  827.0537 186.0700 0.02286253 134.2642
## 7      7 1081.871 0.9405316  805.2130 193.7222 0.02391327 129.3790
## 8      8 1093.060 0.9401301  821.8018 179.2924 0.02223264 123.2936
## 9      9 1080.926 0.9413322  810.9890 183.5284 0.02284979 123.7845
## 10    10 1073.305 0.9412974  804.6415 194.8817 0.02440464 127.6605
## 11    11 1068.352 0.9416120  798.9384 186.5494 0.02356698 119.6480
## 12    12 1078.962 0.9401214  813.9811 200.8309 0.02528910 137.1344
## 13    13 1065.720 0.9419299  807.3506 194.5372 0.02464378 139.9376
## 14    14 1086.219 0.9400449  825.5309 188.3027 0.02509340 145.9595
## 15    15 1100.238 0.9388400  838.3119 183.7699 0.02544104 145.7394
## 
## Call:
## lm(formula = MonthlyIncome ~ BusinessTravel + DistanceFromHome + 
##     JobLevel + JobRole + TotalWorkingYears, data = data.red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3103.6  -615.1   -62.6   548.1  3772.3 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2943.693    403.992   7.287 3.71e-12 ***
## BusinessTravelTravel_Frequently   402.277    242.793   1.657 0.098738 .  
## BusinessTravelTravel_Rarely       797.181    207.626   3.840 0.000155 ***
## DistanceFromHome                  -22.647      7.417  -3.053 0.002495 ** 
## JobLevel2                        1982.965    270.742   7.324 2.94e-12 ***
## JobLevel3                        4840.400    355.287  13.624  < 2e-16 ***
## JobLevel4                        8392.418    536.238  15.651  < 2e-16 ***
## JobLevel5                       10924.085    614.004  17.792  < 2e-16 ***
## JobRoleHuman Resources           -697.920    471.746  -1.479 0.140219    
## JobRoleLaboratory Technician    -1027.635    341.026  -3.013 0.002836 ** 
## JobRoleManager                   3695.383    456.280   8.099 2.08e-14 ***
## JobRoleManufacturing Director     275.794    333.423   0.827 0.408897    
## JobRoleResearch Director         3866.907    403.800   9.576  < 2e-16 ***
## JobRoleResearch Scientist        -745.963    346.487  -2.153 0.032233 *  
## JobRoleSales Executive            403.523    255.766   1.578 0.115835    
## JobRoleSales Representative     -1012.452    382.174  -2.649 0.008557 ** 
## TotalWorkingYears                  41.460     14.492   2.861 0.004565 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1029 on 263 degrees of freedom
## Multiple R-squared:  0.9517, Adjusted R-squared:  0.9487 
## F-statistic: 323.7 on 16 and 263 DF,  p-value: < 2.2e-16

Based on the regression analysis, we identify the following variables to have significant impact on the Monthly Income:

  1. Business Travel
  2. Distance From Home
  3. Job Level
  4. Job Role
  5. Total Working Years

We will use these variables for our prediction.

Let’s first validate the model for Monthly Income in terms of Error (Root Mean Square Error)

## [1] "Root Mean Square Error: 1162.83567849355"
RMSE is good (less than $3,000), we can use this model to predict the competition set.

###Attrition

Now, lets run the statistical Analysis for Attrition.

We will use two different approaches for analysia and prediction of Attrition:

  1. Logistic regression - For analysis and Prediction
  2. K- Nearest Neighbours

To start with, we will run Logistic Regression to identify the significant variables that contribute towards Attrition.

Logistic Regression

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = Attrition ~ ., family = binomial(link = "logit"), 
##     data = data.red)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3656  -0.2508   0.0000   0.2640   3.5295  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -4.413e+00  2.186e+03  -0.002  0.99839    
## Age                              -2.171e-02  3.927e-02  -0.553  0.58029    
## BusinessTravelTravel_Frequently   3.450e-01  1.229e+00   0.281  0.77899    
## BusinessTravelTravel_Rarely       3.908e-02  1.126e+00   0.035  0.97232    
## DailyRate                         1.678e-04  7.516e-04   0.223  0.82338    
## DepartmentResearch & Development  4.006e+01  2.458e+03   0.016  0.98700    
## DepartmentSales                   3.969e+01  2.458e+03   0.016  0.98712    
## DistanceFromHome                  1.195e-01  4.441e-02   2.690  0.00714 ** 
## Education2                        8.002e-01  1.022e+00   0.783  0.43364    
## Education3                        1.960e+00  8.957e-01   2.188  0.02865 *  
## Education4                        9.234e-01  9.249e-01   0.998  0.31810    
## Education5                       -4.343e+00  1.770e+00  -2.453  0.01415 *  
## EducationFieldLife Sciences      -2.437e+01  1.124e+03  -0.022  0.98270    
## EducationFieldMarketing          -2.504e+01  1.124e+03  -0.022  0.98223    
## EducationFieldMedical            -2.434e+01  1.124e+03  -0.022  0.98273    
## EducationFieldOther              -2.389e+01  1.124e+03  -0.021  0.98304    
## EducationFieldTechnical Degree   -2.353e+01  1.124e+03  -0.021  0.98330    
## EnvironmentSatisfaction2         -2.351e+00  1.068e+00  -2.201  0.02772 *  
## EnvironmentSatisfaction3         -6.505e-01  8.794e-01  -0.740  0.45948    
## EnvironmentSatisfaction4         -1.400e+00  8.278e-01  -1.691  0.09081 .  
## GenderMale                       -1.048e+00  7.415e-01  -1.413  0.15773    
## HourlyRate                        1.289e-02  1.631e-02   0.791  0.42910    
## JobInvolvement2                  -1.332e+00  1.408e+00  -0.946  0.34410    
## JobInvolvement3                  -3.745e+00  1.312e+00  -2.855  0.00431 ** 
## JobInvolvement4                  -5.266e+00  1.640e+00  -3.211  0.00132 ** 
## JobLevel2                        -2.901e+00  1.496e+00  -1.939  0.05245 .  
## JobLevel3                        -2.554e+00  2.364e+00  -1.080  0.28013    
## JobLevel4                        -3.542e+00  3.584e+00  -0.988  0.32305    
## JobLevel5                         2.602e+00  4.542e+00   0.573  0.56674    
## JobRoleHuman Resources            3.527e+01  2.458e+03   0.014  0.98855    
## JobRoleLaboratory Technician     -2.616e+00  1.871e+00  -1.398  0.16205    
## JobRoleManager                   -1.626e+00  2.929e+00  -0.555  0.57894    
## JobRoleManufacturing Director    -3.704e+00  2.104e+00  -1.761  0.07831 .  
## JobRoleResearch Director         -6.273e+00  3.088e+00  -2.032  0.04220 *  
## JobRoleResearch Scientist        -4.210e+00  1.881e+00  -2.238  0.02520 *  
## JobRoleSales Executive            5.940e-01  3.279e+00   0.181  0.85622    
## JobRoleSales Representative       2.080e-02  3.475e+00   0.006  0.99522    
## JobSatisfaction2                 -2.114e+00  1.039e+00  -2.035  0.04182 *  
## JobSatisfaction3                 -1.158e+00  8.886e-01  -1.303  0.19243    
## JobSatisfaction4                 -2.657e+00  9.841e-01  -2.700  0.00694 ** 
## MaritalStatusMarried              1.009e+00  1.242e+00   0.813  0.41629    
## MaritalStatusSingle              -1.758e-01  1.600e+00  -0.110  0.91255    
## MonthlyIncome                    -2.927e-04  2.870e-04  -1.020  0.30778    
## MonthlyRate                      -1.042e-05  4.367e-05  -0.239  0.81144    
## NumCompaniesWorked1               2.530e+00  1.046e+00   2.418  0.01560 *  
## NumCompaniesWorked2               2.719e+00  1.497e+00   1.816  0.06932 .  
## NumCompaniesWorked3               1.901e+00  1.621e+00   1.173  0.24080    
## NumCompaniesWorked4               2.363e+00  1.604e+00   1.474  0.14060    
## NumCompaniesWorked5               3.867e+00  1.759e+00   2.198  0.02794 *  
## NumCompaniesWorked6               5.830e+00  2.317e+00   2.516  0.01186 *  
## NumCompaniesWorked7               1.038e+00  1.548e+00   0.671  0.50234    
## NumCompaniesWorked8               2.111e+00  1.726e+00   1.223  0.22138    
## NumCompaniesWorked9               5.996e+00  2.551e+00   2.350  0.01875 *  
## OverTimeYes                       2.800e+00  7.013e-01   3.993 6.54e-05 ***
## PercentSalaryHike                -1.011e-01  1.353e-01  -0.748  0.45459    
## PerformanceRating4                4.838e-01  1.337e+00   0.362  0.71740    
## RelationshipSatisfaction2        -3.772e+00  1.187e+00  -3.177  0.00149 ** 
## RelationshipSatisfaction3        -2.534e+00  9.873e-01  -2.567  0.01027 *  
## RelationshipSatisfaction4        -1.245e+00  8.737e-01  -1.425  0.15410    
## StockOptionLevel1                -3.364e+00  1.183e+00  -2.843  0.00447 ** 
## StockOptionLevel2                -6.063e+00  1.965e+00  -3.086  0.00203 ** 
## StockOptionLevel3                 2.458e+00  1.932e+00   1.273  0.20312    
## TotalWorkingYears                 1.906e-02  1.048e-01   0.182  0.85572    
## TrainingTimesLastYear            -7.689e-02  2.383e-01  -0.323  0.74693    
## WorkLifeBalance2                 -1.579e+00  1.376e+00  -1.148  0.25107    
## WorkLifeBalance3                 -2.086e+00  1.351e+00  -1.544  0.12270    
## WorkLifeBalance4                 -2.083e+00  1.484e+00  -1.403  0.16062    
## YearsAtCompany                    2.111e-01  1.559e-01   1.354  0.17579    
## YearsInCurrentRole               -2.927e-01  2.117e-01  -1.383  0.16672    
## YearsSinceLastPromotion           2.957e-01  1.429e-01   2.069  0.03857 *  
## YearsWithCurrManager             -2.527e-01  1.942e-01  -1.301  0.19321    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 388.16  on 279  degrees of freedom
## Residual deviance: 135.65  on 209  degrees of freedom
## AIC: 277.65
## 
## Number of Fisher Scoring iterations: 16

Based on the significant variables identified in initial run of Logistic Regression model. Make a subset of the variables from statistical analysis and EDA.

Now, Let’s validate the model.

## [1] "Accuracy =  71.4285714285714 %"
## [1] "Sensitivity =  73.3333333333333 %"
## [1] "Specificity =  69.2307692307692 %"

The Accuracy, Sensitivity and Specificity looks good. Now lets try and run KNN and check the prediction accuracy.

KNN

In order to use categorical variables for KNN prediction, first we need to convert them into Integers. Once we have all numeric and integer variables, we will standardize (normalize) then for prediction. To remove the influence of a parameter being in larger scale.

First we will identify the best value of K based on the Accuracy. By running 100 iterations.

## [1] "Best Accuracy at K =   19 of 70.2380952380952 %"
## [1] "Mean Accuracy =  65.7857142857143 %"
## [1] "Mean Sensitivity =  82.7111111111111 %"
## [1] "Mean Specificity =  46.2564102564103 %"

Since iteration recommended K=19. We will run KNN for K=19.

## [1] "Accuracy for K-19 =  70.2380952380952 %"
## [1] "Sensitivity for K-19 =  84.4444444444444 %"
## [1] "Specificity for K-19 =  53.8461538461538 %"

For K=19, we got good Accuracy (above threshold of 60%). However, the Specificity is comparatively lower and marginally under 60%.

Comparing the results from KNN and Logistic Regression. We decide to go ahead with Logistic Regression Model.

Results of Predictive Model using Logistic Regression are printed on the Case2PredictionsApurv Attrition.csv file.

This concludes the analysis for this project.